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Abstract 

It is often difficult to quantitatively determine if a new molecular simulation algorithm or 
software properly implements sampling of the desired thermodynamic ensemble. We present 
some simple statistical analysis procedures to allow sensitive determination of whether a de- 
sired thermodynamic ensemble is properly sampled. We demonstrate the utility of these tests 
for model systems and for molecular dynamics simulations in a range of situations, includ- 
ing constant volume and constant pressure simulations, and describe an implementation of the 
tests designed for end users. 

1 Introduction 

Molecular simulations, including both molecular dynamics (MD) and Monte Carlo (MC) tech- 
niques, are powerful tools to study the properties of complex molecular systems. When used to 
specifically study thermodynamics of such systems, rather than dynamics, the primary goal of 
molecular simulation is to generate uncorrelated samples from the appropriate ensemble as effi- 
ciently as possible. This simulation data can then be used to compute thermodynamic properties 
*To whom correspondence should be addressed 
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of interest. Simulations of several different ensembles may be required to simulate some thermo- 
dynamic properties, such as free energy differences between states. An ever-expanding number 
of techniques have been proposed to perform more and more sophisticated sampling from com- 
plex molecular systems using both MD and MC, and new software tools are continually being 
introduced in order to implement these algorithms and to take advantage of advances in hardware 
architecture and programming languages. 

However, it is extremely easy to make subtle errors in both the theoretical development and the 
computer implementation of these advanced sampling algorithms. Such errors can occur because 
of numerical errors in the underlying energy functions, theoretical errors in the proposed algorithm, 
approximations that are too extreme, and the programming bugs that are inevitable when managing 
more and more complicated code bases. 

There are a number of reasons it is difficult to validate a given implementation of an algorithm 
for the proper thermodynamic behavior. First, we lack analytical results for virtually all complex 
molecular systems, and analytically soluble toy problems may not have all of the features that more 
complicated systems of actual research interest may possess. Additionally, molecular simulations 
generate statistical samples from the probability distribution of the system. Most observables there- 
fore require significant simulation time to reduce statistical noise to a level sufficiently low to allow 
conclusive identification of small but potentially significant violations of the sampled ensembles. 

There are of course some aspects of molecular distributions that can and should always be 
checked directly. For example, in an NVE ensemble the total energy should be conserved with 
statistically zero drift. For symplectic integrators with NVE simulations, the RMS error will scale 
with the square of the step size. For an NVT ensemble when the potential energy is independent 
of particle momenta (which is true with the rare exception of systems with magnetic forces), the 
kinetic energy will follow the Maxwell-Boltzmann distribution and the consistency of sampled 
results can be tested against this distribution with standard statistical methods. NVT simulations 
must have an average kinetic energy corresponding to the desired temperature, and NPT simula- 
tions must have the proper average instantaneous pressure computed from the virial and kinetic 
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energy. However, there are no standard tests for proper distribution for the potential energy, which 
greatly complicates Monte Carlo simulations, or for total energy of an arbitrary simulation system. 
Additionally, there are many possible distributions which have the correct average temperature or 
pressure, but do not satisfy the proper Boltzmann probability distributions for our specific ensem- 
ble of interest. 

It is therefore worthwhile to have physically rigorous strategies and tools for assessing whether 
a simulation method is indeed generating samples from the desired distribution in its entirety. Such 
general strategies could help to better answer vital questions such as "Is this thermostat/barostat 
correct?," "How much does a very long time step affect my energy distribution?" and of course, 
"Have I finally got all the bugs out of my code now?" 

2 Theory 

Thermodynamic ensembles all have similar probability distributions with respect to macroscopic 
intensive parameters and microstates, e.g.: 

P(x |/3) oc exp(— /3 H(p : q)) canonical (1) 

P(x,V\P,P)°cexp(-P(H(p,q)+PV)) isobaric-isothermal (2) 

P(x,N\P,p,) oc exp(-/3 - /W,)) grand canonical (3) 

species 

where P(a\b) indicates the probability of a microstate determined by variable or variables a given 
macroscopic parameter or parameters b. Specifically, all have the exponential form exp(— u(x)) 
where x = (p, q, V, N) is the microstate and u(x) is a reduced energy term whose form depends on 
the ensemble. 

This reduced energy term is a generalized function of two types of variables. The first type of 
variables are the degrees of freedom determining the microstates of each ensemble, including the 
positions and velocities of the atoms, but also potentially including the volume of the system V 
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and the number of particles of each of i species in the system iV ; . The second type of variables are 
those determining the ensemble of the physical system, including the temperature T, the pressure 
P, the chemical potentials /!,, and the specific functional form of the Hamiltonian H(p,q). These 
equations, along with the requirement that all microstates with the same value for the generalized 
energy term have the same probability, completely define the thermodynamic ensemble. A general 
test should therefore check as directly as possible that the samples we collect are fully consistent 
with Eqs. 1-3. For simplicity, we will perform an initial derivation of such a test using the canonical 
ensemble, and then generalize the derivation to other ensembles. 

The probability density of observing a specific energy in the canonical ensemble (Eq.l) can be 
written in terms of the density of states Q.(E) = Qxp(S(N,V,E)/ks) as 



where S is the entropy, /3 = (1cbT)~ 1 , kg is Boltzmann's constant, and Q(j5) = J Q.(E)exp(—j5E)dE 
is the canonical partition function, related to the Helmholtz free energy A by A = — /3 1 ln<2- Q 
is a function of /3, but not E, whereas Q. is a function of E, but importantly, not /3. Note that at 
this point, E is specifically the total energy, though we will examine kinetic and potential energies 
separately later on. 

Without specific knowledge of what the density of states Q,(E) is for a particular molecular 
system, no quantity of samples from a single state can identify if the energies indeed have the 
proper distribution. However, if we take the ratio of the probability distributions of two simulations 
performed at different temperatures, hence with two different values of /3, but with otherwise 
identical parameters, the unknown density of states cancels leaving: 



P(E\P) = Q(P)- l £l(E)exp(-l3E) 



(4) 



P(E\Pi) 



exp(-ft£) 

am 



exp(-jBiE) 

G(ft) 

exp([/3 2 A 2 -/3 1 A 1 ]-[/3 2 - J 8 1 ] J B) 



(5) 
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If we take the logarithm of this ratio, we obtain: 

ln Wlfcj = [& A 2-AAi]-[j6 2 -A]£ (6) 

which is of the linear form (Xq + U\E. Note that linear coefficient ai = — Q82 — jSi) is independent 
of the (unknown in general) Helmholtz free energies A 2 and A\. 

This relationship forms the basis of the ensemble validation techniques we present in this paper. 
Similar formulas can be derived for any of the standard thermodynamic ensembles with probability 
distributions of the form e~"w as long as the reduced energy term is linear in conjugate parameters. 
Non-exponential probability distributions are certainly possible to generate in simulations, but are 
much less standard, and so we will not deal directly with them in this study. The same general 
techniques will work if the probability of a given microstate depends only on the energy of the 
microstate. We will call agreement of a simulation with its target distribution as described by Eq. 6 
and its analogs for other ensembles as ensemble consistency. 

There are a number of ways to check if the distribution of samples from a given pair of simu- 
lations satisfies these equations. The most straightforward way starts with binning the energies E 
from both simulations. If the distributions are sufficiently close together to have statistically well- 
defined probabilities at overlapping values of E and we have sufficient data, we can fit the ratio of 
the histogram probabilities to a line in this overlap region. If the slope deviates from — (fa — /3i ) by 
a statistically significant amount, then the data necessarily deviates from a canonical distribution. 
However, deciding quantitatively what constitutes "statistically significant" can be challenging, 
and will be further explored in this paper. 

This test of consistency with Eq. 6 is a necessary test for an algorithm that is consistent with 
the canonical ensemble; if the slope of the probability ratio deviates from the true line, the data 
cannot be consistent with the ensemble. However, the test is not necessarily a sufficient test of 
simulation quality as it does not include any direct test of ergodicity. Specifically, it says nothing 
about whether states with the same energy are sampled with equal probability as is required by 
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statistical mechanics. It also does not say anything about whether there are states that are not 
sampled. We could have sampling consistent with the desired ensemble but trapped in only a small 
portion of the allowed phase space of a system. 

In general, additional tests of convergence or ergodicity are required before the system can 
be guaranteed to be sampled correctly. For example, for molecular dynamics, one could examine 
the kinetic energy of different partitions of the degrees of freedom as can be used to diagnose 
such problems as the "flying ice cube," occurring in some poorly configured simulations when the 
center of mass degrees of freedom are decoupled from other degrees of freedom. 1 However, for 
testing algorithms or code, simple systems that are both sufficiently complicated and general can 
usually be found which will behave ergodically within a reasonable amount of simulation time. 
Therefore, in the rest of this paper, we will analyze systems which are clearly sampled ergodically 
and which have converged ensemble averages of interest, so we will not require any additional 
tests of ergodicity or convergence. 

Having analyzed the potential problems with such ensemble validation analysis, we next ex- 
plore possible methods to quantify deviation from the canonical ensemble using data collected 
from pairs of simulations. 

2.1 Visual Inspection 

We can divide the common energy range of the two simulations into bins (perhaps 20-40, de- 
pending on the amount of data, numbers chosen solely from experience through trial and error). 
Bins need not be equally spaced, though this simplifies the analysis considerably by removing the 
need to correct the probability densities for differing width of bins. It is also greatly simplifies the 
analysis to select bin divisions that are aligned between the two data sets. Bins can be chosen to 
exclude a few points on top and the bottom of each distribution to avoid small sample error and 
zero densities at the extremes. P\(E) and Pi{E) in each bin can then be estimated directly from 
the histograms. We can compute the ratio of these histograms at each value of the energy at the 
centers of the bins, and plot either the ratio, or more cleanly, logarithm of this ratio, as shown in 
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Figure 1: Ensemble validation of water simulations. Validation of the energy distribution of 900 
TIP3P water molecules simulated in the NVT ensemble using the Nose-Hoover algorithm 2 . The 
predicted value and the actual data for both linear (a) and nonlinear (b) fits quantitatively agree. 

Fig. 1. If this logarithm ratio is linear, we have a system that for all qualitative purposes obeys the 
proper equilibrium distribution. 

Qualitatively, if the actual slope of the log energy ratios is below the expected, slope, it means 
that the low /3 (high temperature) simulation samples that particular energy less than it should, 
while if it is above the true line, it means that portion of the distribution is oversampled. A con- 
sistently higher observed slope therefore means that the distribution is narrower than it should be, 
and a lower observed slope means that the distribution is wider than it should be. 

2.2 Quantitative Fitting 

The relationships presented so far are not entirely novel; visual inspection of probability ratios of 
paired temperature replica exchange simulations has been used previously to check that neighbor- 
ing replicas have the proper distributions relative to each other. 3 ' 4 However, there does not appear 
to have been an effort to use this relationship as a general test to quantitatively analyze simulations 
for goodness-of-fit to the putative ensemble distributions. 
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2.2.1 Linear fitting 

To make this ensemble test quantitative, we estimated the error in the occupancy probability p^ of 
each bin i as 8pi c = a/ — Pk) l n ( a standard probability result), and propagate the error in the 
individual bin probabilities into the ratio P2(E)/P\(E)) (a process detailed in Appendix B). If the 
true slope lies consistently outside of the error estimates then it is very likely the simulation is not 
correctly sampling the desired ensemble. Calculation of the histogram uncertainties also allows us 
to perform weighted linear and nonlinear least squares fitting (details also in Appendix B). This 
allows us to includes the effect of small sample error at the extremes of the distribution in our 
fitting. We can use standard error propagation methods to propagate the error in the histogram 
occupancy ratios into the error in the linear parameters. 

2.2.2 Nonlinear fitting 

It is well-known that linearizing a mathematical relationship in order to perform linear least squares 
can introduce bias in the estimation of the parameters. It is therefore often preferable to minimize 
the direct sum of the residuals S r (a) = — f(oc, x,-)) , which is a nonlinear function of a, and 
then propagate the error in the histogram bins into the uncertainties of the components of a. In this 
particular problem, we want to determine the two parameter fit that minimizes the sum of residuals 
5 r (ao> cci) for the function 

S r (ao,ai) = £ 

2.2.3 Maximum likelihood estimates 

Any results from either the linear or nonlinear case may affected by the choice of histogram binning 
we use. In theory, we can vary the number of histogram bins to ensure the answers are not depen- 
dent on the number of bins. However, we can completely eliminate the histogram dependence as 
well as including the data at the tails rather than truncating them by using a maximum likelihood 
approach. A maximum likelihood approach allows us to predict the most likely parameters for a 
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PijEi) 
A(Ei) 



— exp(«o + Oi\Ei) 



(7) 



given statistical model from the data that has been observed. 

Previously, we have used such a maximum likelihood approach to compute the free energy dif- 
ference between forward and reverse work distributions between two thermodynamic states at the 
same temperature, 5 which is equivalent to computing the value of /3 1 ln<2i/<22 with fixed /3. In 
the present case, we now have two parameters in the distribution which we must fit, (Xq = In Q\ jQi 
and d\ = — (jh. ~ Pi)- Applying a maximum likelihood approach along the lines described in the 
paper 5 leads to log likelihood equations: 

Ni N 2 

lnL(a|data) = ^ln/(-a - Cfi^) + ^lnf(<Xo + <XiEj) 
i=i j=i 

where f(x) is the Fermi function f(x) = [1 +exp(— x)] , and where the first sum is over energies 
sampled at temperature T\ and the second sum is over the energies sampled at T2. The most likely 
parameters are the ones which maximize the likelihood function in Eq. 8. This particular function 
can be shown to have no minima and only one maximum, so it will always converge. 

Eq. 8 can be solved by any of the standard techniques for multidimensional optimization as it 
is everywhere concave. There is one minor technicality; clearly, the variance can be minimized to 
zero by setting 0Cq = ot\ = 0, which is not physically consistent with the data. There is therefore an 
additional constraint we must first identify to find a unique minimum. 

In performing this likelihood maximization, we note that although there are four parameters 
explicitly stated, A i,A2, /3i and /32,only twoof them are actually free parameters. Examining Eq. 6, 
we can express the relationship to the physical quantities as (Xq = P2A2 — P1A1 and ai = — Q82 — 
/3i). We also note that Eq. 6 does not allow us to test for /3i and directly, but instead is only 
a function of the difference — j8i, so we must actually treat this as one variable corresponding 
to a single degree of freedom. A simple choice is to treat /3i + as a constant in what amounts 
to a choice of the energy scale. We can therefore set /3 ave = ji^i.user + Pl.user), the user specified 
temperatures. Ai and A2 are the free energies of the system, so there is no physical meaning to 
their absolute value, only their difference. Without loss of generality, we set A\ +Aq = 0, and treat 
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AA = A2 — A\ as our second independent variable. These two choices allow us to then solve for 
unique values of Oo and ct\, rewriting (Xq + <X\E = j8 aV e(^2 — ^1) — (fa — fa)E = /3 ave AA — AftE, 
an expression that explicitly only has two free parameters. 

One downside of using a maximum likelihood analysis is that it does not give a graphical 
representation; it is histogram independent, and so we do not have a histogram that we can plot! 
A linear fit should therefore be performed in conjunction with maximum likelihood analysis to 
quickly visualize the data as a sanity check. 

2.3 Error estimates 

Once we have an estimate of the slope fa — fa , we must ask if the slope deviates from the true value 
with a statistically significant deviation or if the difference more likely due to statistical variation. 
For this, we can turn to error estimation techniques to find a statistically robust approximation for 
the error in fa — fa and to determine if any deviations from the true value are most likely a result 
of statistical noise or actual errors in the simulation. 

For weighted linear least squares, weighted nonlinear least squares, and multiparameter max- 
imum likelihood logistic regression, the analytic asymptotic error estimators for the covariance 
matrix of fitting parameters are all well-known statistical results: 

linear cov(a) = (X T WX)- 1 

nonlinear cov(a) = (J T W~ l J)~ l 

maximum likelihood cov(a)= (Hess(lnL) a ) _1 

In all equations a is the vector of parameters we are estimating. In the first equation, X is the 
(M+ 1) x N matrix with the first column all ones, and the second through (Af + l)th column the 
values of the ./V observations of the M observables. In the second equation J is the Jacobian of 
the model with respect to the vector of parameters, evaluated at observations and the values of 
the parameters minimizing the nonlinear fit. In the last equation, Hess(lnL) is the Hessian of log 
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likelihood with respect to the parameters, and W is a weight matrix consisting of the variances of 
the values of each data point estimated from the histograms. We explore these expressions more 
completely in the Appendices B and C. 

We can also use bootstrap sampling of the original distribution data to generate error estimates, 
which has proven to be a reliable error estimation method for free energy calculations. 6 Although 
more computationally intensive, the total burden is relatively low. For example, it takes only 20 
minutes on a single core of a 2.7 GHz Intel i7 processor to perform 200 bootstrap samples, even 
with 600 000 energy evaluations from each simulation. 

Once we have generated error estimates for our estimates of the parameters, we can ask the 
underlying statistical question of whether deviations from the true result are likely caused by sta- 
tistical error or by errors in the underlying data. In most cases we will have collected enough 
samples that the deviation from the fit should be distributed normally. In this case, we can simply 
compute the standard deviation of the fit parameters and ask how many standard deviations the 
calculated slope fa — fa is from the user specified slope. If this difference is consistently more 
than 2-3 o away from the true value in repeated tests, it indicates that there are likely errors with 
the simulations as the two distributions do not have the relationship that they would have if they 
obeyed a canonical distribution. More sophisticated statistical tests are possible that do not assume 
normality but the straightforward normal assumption appears to work fairly well to diagnose prob- 
lems for all cases presented here. It is important to note that the number of standard deviations a 
number is from the expected result is not necessarily a measure of the size of the error. Instead, it 
is a measure of how certain we are of the error as we may be measuring either a very small error 
with extremely high numerical precision or a large error with little precision. 

2.4 Choosing the parameter gap 

We note that the relationship in Eq. 6 is true for any choice of the temperatures /3i and fa. However, 
if fa and fa are very far apart then the two probability distributions P(E\fa) and P(E\fa) will not 
be well determined over any range of E in any simulation of reasonable length. If, on the other 
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hand /3i = J82, no information can be obtained because the simulations will be statistically identical. 
If the two simulations are not statistically identical there are deeper problems to worry about than 
if the simulations are ensemble consistent! 

Coming in from these two limits, if /3i and are moderately far apart, small-sample noise from 
the extremes of the distribution will make it difficult to determine the deviations from fh. — Pi- 
If j8i and are too close together, even the relatively small statistical noise at the centers of 
the distributions will swamp out the information contained in the very slight difference between 
the user-specified temperature gap and the simulation's actual value for J82 — /3i . There should 
therefore be some ideal range of temperature gaps giving the most statistically clear information 
about deviations from ensemble consistency. We will examine specific choices of this gap for 
different systems in this study. 

2.5 Sampling from the canonical ensemble with a harmonic oscillator 

To study these ensemble validity tests in practice, we first examine a toy model, sampling from a 
D-dimensional harmonic oscillator. We then use this model to demonstrate the use of this method 
to identify simulation errors. 

For a Z)-dimensional harmonic oscillator with an equal spring constant K in each dimension 
and equilibrium location x^q in each direction, the total potential energy of the system is E = 
jKY^=i(xi —Xifi) 2 . The partition function for this model is Q(fi) = ( j^) D ^ 2 , meaning the free 
energy is A(j8) = — (D/2/3) ln[(27r) / (fiK)], and the probability of a given configuration x is 



For this exercise, we set x^o = for all i for simplicity, and choose D = 20. We specifically 
do not choose D = 1, because it can give results that may not be typical for other choices of 
dimensions. For D = 1, the density of states o{E) is constant for this choice of E, i.e. P(E) °c 
exp(— fiE) for all spring constants. Unlike most physical densities of states, in this case E = 
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has nonzero probability for all temperatures, which means samples from all temperatures have 
nonnegligible overlap. Harmonic oscillators with D 3> 1 have Q.(E) = at E = and then rapidly 
increasing as E increases, much more characteristic of more typical physical systems. 

2.5.1 Testing for ensemble validity with a toy system with simulation noise 

Using this toy system, we collect samples with K = 1 and /3 = 1.3 and 0.7 (the specific choice 
of temperature gap is explained later). After generating the samples, we add random noise 8E = 
v \N(0, 1)|, where N(0, 1) is a Gaussian random variate with mean zero and standard deviation 1, 
and v is some small positive constant. The addition of random noise allows us to test the ability of 
the algorithm to identify simple errors in the energy distributions. In each case, we carry out 200 
independent repetitions of the procedure, each time with 500,000 samples from each of the two 
different temperature distributions. We note that this particular type of error means that the data are 
generated with the correct probability, but their energies are stored incorrectly. This pattern might 
not be typical of actual errors observed in molecular simulations, but serves as a useful starting 
point for characterizing the sensitivity of this procedure. The results are shown as a function of 
noise in Table 1, with 0.6 the exact result for — Pi- The average energy of each distribution 
is —^jj^ = M$ or 16.667 in this specific case. We examine the linear, nonlinear, and maximum 
likelihood fits, with the error calculated by the analytical estimates, sample standard deviations 
over 200 repetitions, and bootstrap sampling using 200 bootstrap samples from the first of the 200 
repetitions. 

In all cases in Table 1, bootstrap sampling closely matches the standard sample error from 
200 independent samples, suggesting that bootstrap error estimation is likely to be as effective 
as independent sampling to identify ensemble errors, as was also observed in previous free energy 
calculations. 6 Additionally, the analytical error estimates for linear and maximum likelihood fitting 
closely match the sample standard deviation. This is particularly encouraging because it means that 
single pairs of simulations are enough to calculate error estimates robustly. 

Nonlinear fitting is somewhat less useful, as the nonlinear analytical error estimates appear to 
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Table 1 : All fitting forms are sensitive determinants of noise in the energy. Number of standard 
deviations from the true slope for the given error estimate method are shown in parentheses. All 
fitting forms (linear, nonlinear, maximum likelihood) in combination with all estimators of the 
error (analytic, independent replicas, and bootstrap sampling) are sensitive determinants of noise 
amount (nu) in the energy. Nonlinear fitting is somewhat less useful, as the nonlinear analytical 
error estimate does not match the sample and bootstrap error estimates, in contrast to linear and 
maximum likelihood analytical error estimates. The true fa — fa = 0-6. 





V 


0.0 


0.005 


0.0075 


0.01 


0.02 


analytic 
(single sample) 


linear 
nonlinear 
max. likelihood 


0.6006 ± 0.0012 (0.5) 
0.6028 ±0.0013 (2.3) 
0.6016 ± 0.0012 (1.3) 


0.5955 ± 0.0012 (3.7) 
0.6019 ±0.0012 (1.6) 
0.5973 ± 0.0012 (2.3) 


0.5927 ±0.0012 (6.0) 
0.5969 ±0.0012 (2.5) 
0.5939 ±0.0012 (5.2) 


0.5924 ± 0.0012 (6.3) 
0.5916 ±0.0012 (6.9) 
0.5936 ±0.0012 (5.5) 


0.5841 ±0.0012(13.3) 
0.5899 ± 0.0012 (8.3) 
0.5850 ± 0.0012 (13.0) 


200 replicates 


linear 
nonlinear 
max. likelihood 


0.5991 ±0.0012(0.8) 
0.5991 ±0.0031 (0.3) 
0.6001 ±0.0012(0.1) 


0.5958 ± 0.0012 (3.6) 
0.5964 ±0.0030 (1.2) 
0.5968 ±0.0011 (2.9) 


0.5941 ±0.0012(4.8) 
0.5946 ±0.0031 (1.8) 
0.5950 ± 0.0012(4.2) 


0.5922 ±0.0012 (6.5) 
0.5931 ±0.0031 (2.3) 
0.5932 ±0.0012 (6.0) 


0.5838 ±0.0011 (14.9) 
0.5870 ± 0.0029 (4.4) 
0.5848 ±0.0011 (14.2) 


200 bootstraps 


linear 
nonlinear 
max. likelihood 


0.5999 ±0.0013 (0.0) 
0.6019 ± 0.0034 (0.6) 
0.6013 ±0.0012 (1.1) 


0.5953 ±0.0012 (4.1) 
0.6017 ± 0.0031 (0.6) 
0.5972 ±0.0011 (2.5) 


0.5927 ±0.0012 (6.2) 
0.5940 ± 0.0029 (1.0) 
0.5950 ± 0.0012(5.3) 


0.5922 ±0.0013 (6.0) 
0.5915 ±0.0032 (2.7) 
0.5935 ±0.0013 (5.2) 


0.5839 ± 0.0013 (12.7) 
0.5890 ±0.0031 (3.1) 
0.5848 ± 0.0012 (12.4) 



noticeably underestimate the actual error, as determined by the sample standard deviation over 200 
repetitions. The statistical error in nonlinear fitting is larger than the error in the linear and maxi- 
mum likelihood estimates, possibly because of a magnified effect of small sample errors. However, 
all fitting forms (linear, nonlinear, maximum likelihood) in combination with all estimators of the 
error (analytic, independent replicas, and bootstrap sampling) are relatively sensitive determinants 
of noise in the energy. Deviations of more than 3o occur consistently for v as low as 0.0075, or 
less than 1% of kgT , demonstrating that these errors have become statistically significant. Even 
with v = 0.01, where the slope is between 5 and 7 standard deviations from the true slope, the 
visual difference between estimates becomes virtually unnoticeable, both for the actual distribu- 
tions and the ensemble validation fit, as seen in Fig. 2. The ability to sensitively identify errors 
that cannot be directly visualized demonstrates the utility of this quantitative approach. Overall, 
it appears that maximum likelihood error estimates are the best method to use, as discretization 
errors due to poor histogram choice will not matter. However, linear fitting also appears robust, at 
least for this system. 

The ensemble validation relationship is true for all choices of fa and fa but as discussed, for 
finite numbers of samples, there are problems with choices of fa — fa that are either too large or 
two small. For large slope, small sample error in the tails dominates; for small slopes, the small 
magnitude of the slope becomes difficult to distinguish even at the moderate levels of statistical 
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Figure 2: Model energy distributions and discrimination of error. A small amount of noise 
(0.06% of the average energy) is added to each sample. Such differences affect the distribution 
minimally (a), and are difficult to tell from random noise in the (b) linear fits, but fitting quanti- 
tatively to the distribution reveals that the deviation in the distribution from analytical results is 
5-7 standard deviations beyond that which would be expected. The system is the same as used for 
Table 1 with error scale v = 0.01. 
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Table 2: Optimizing temperature spacing to improve error detection in the distribution. De- 
viation from the correct slope of the log ratio of the energy distributions as a function of increasing 
distance between the two distributions, as measured by the magnitude of — Pi , with fixed noise. 
The ability to discriminate the error in fa. — Pi reaches an optimum at intermediate separation of 
distributions. 



ft 


ft 


ft -ft 


Estimated j32 — ft 


a deviation 


.05 


0.95 


0.1 


0.0993 ± 0.0006 


1.1 


.10 


0.90 


0.2 


0.1981 ±0.0007 


2.7 


.15 


0.85 


0.3 


0.2970 ± 0.0008 


3.9 


.20 


0.80 


0.4 


0.3960 ± 0.0009 


4.7 


.25 


0.75 


0.5 


0.4948 ± 0.0010 


5.2 


.30 


0.70 


0.6 


0.5936 ± 0.0012 


5.4 


.40 


0.60 


0.8 


0.7913 ± 0.0017 


5.1 


.50 


0.50 


1.0 


0.9907 ± 0.0027 


3.5 


.60 


0.40 


1.2 


1.1930 ± 0.0047 


1.5 


.70 


0.30 


1.4 


1.3916 ±0.0100 


0.8 



error occurring near the peaks of the energy distributions. In Table 1, we use a fixed difference 
in temperatures. Can we identify an optimal temperature difference to detect error? For this 
exercise, we select a fixed low level of random error (v = 0.01), and vary the slope — Pi with 
the average j(P\ + P2) fixed at 1, using the analytical estimate from the maximum likelihood 
parameter estimation, and again using 500,000 samples from each distribution. 

For fixed noise in the energy function, we see in Table 2 the number of standard deviations 
from the true slope to the observed slope as a function of energy gap. The ability to discriminate 
the error in P2 — Pi is lower for both very large and very small temperature gaps, though there is a 
relatively broad range near the middle where the sensitivity of the test, measured in the number of 
standard deviations the measured slope is from the true slope, is relatively constant. 

Examining the energy distributions at maximum error discrimination point (/3i = 0.6, J82 = 1.3) 
we find that the difference between the centers of the distributions {\A3ksT — 1 .IksT = 6.6£gT ) is 
approximately equal to the sum of the standard deviations of the distributions (4.5fcsT + l.AkgT = 
6.9ksT). This suggest a general rule-of-thumb that we can maximize the ability to identify errors 
by using temperatures separated by approximately the sum of the standard deviations of the distri- 
butions. The precise value of the difference will not matter particularly in most cases as long as 
we are somewhat near the optimum. With less data, we might err on the side of using a slightly 
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smaller gap to guarantee good overlap in the distributions near E = 0. 

This rule is simply intended as a guideline, as some sources of error might show up preferen- 
tially in the tails and thus require larger temperature gaps to observe, but provides a useful starting 
criteria. One example of a physical system which violates this rule is a 1-D harmonic oscillator, 
which has a constant density of states Cl(E). Although the statistical error does indeed increase 
with decreasing overlap, the slope increases faster, and thus sensitivity to statistical error always 
increases with increasing temperature gap. With fixed error, the sensitivity with noise magnitude 
v = 0.02 increases from less than one standard deviation for fa — fa =0.1 to over five standard 
deviations for fa — fa = 1.8. However, this case is atypical, because the density of states is a con- 
stant with the maximum probability always at E = 0, so that even when the temperatures are very 
different there is still nonnegligible overlap in the distributions. 

To apply this rule, we still need to estimate the standard distributions of the two distributions. 
If we assume the variance in energy (and therefore the value of the heat capacity) does not change 
very much over the relative narrow range of temperature spacings used to perform ensemble vali- 
dation, then the distributions will also be the same. We can estimate the width o of the distribution 
given a known heat capacity Cy. Specifically, Oe = T^Cyks, so that for a temperature gap to result 
in a difference in the centers of the energy distribution of 2ge, we must have 2oe = |f AT = CyAT, 
which reduces to AT /T = ^Ikg/Cy. Alternatively, in many cases it may make the most sense to 
run a short simulation at the "center" temperature | (T\ + 7^) to estimate the variance, and we can 
use the equivalent relationship AT /T = to identify a reasonable temperature gap fa — fa for 
simulations of a specific system. 

2.6 Isobaric-isothermal ensembles 

Our discussion up to this point has been restricted to NVT systems. However, the same prin- 
ciples can also be applied to check the validity of simulations run at constant temperature and 
pressure, and of simulations run at constant temperature and chemical potential. We will analyze 
isobaric-isothermal simulations extensively in this section. We will not examine grand canonical 
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simulations in this paper, though we do include the derivations in Appendix A. 

There are at least three useful ways we can analyze NPT simulations for validity. First, let us 
assume that we have two simulations run at the same pressure but different temperatures. Then the 
microstate probabilities are: 

P(x,V\P,P)=A(l3,P)- l exp(-pE(x)-l3PV) (8) 

Where A(j3,P) is the isothermal-isobaric partition function. We then integrate out configurations 
with fixed instantaneous enthalpy H = E{x) +PV, where x here is shorthand for both position and 
momentum variables, not the entire microstate specification. We then have 

PP 



P(H\P,P) = ^ J j8[E(x)+PV-H]A([5,P)- 1 exp(-[5(E(x)+PV)dxdV 

h 3N- 



P P T Cl'(H,P)A(P,P) 1 exp(-/3//) 



where £l!(H,P) is a density of states counting the number of states with a given value of H = E + 
PV, and is explicitly a function of P, but not /3. The prefactor of fiP comes from the requirement 
to cancel the units in the integral, ignoring factors of N relating to distinguishability of particles, 
which will cancel in the ratio of distributions in all cases. Because both simulations have the same 
pressure, we arrive directly at a new ensemble validation relationship: 

P(H\ M _ ftA(ft,P) (9) 



M P } H J^ P 1 ) = HMh) + \hG 2 -p l G l ]-\p 1 -p l ]H). do 



P(H\p h P) lhA(lh,P 
P(H\P h P 

The exact same ensemble validation statistical tests can be applied with H in place of E and the 
Gibbs free energy (or free enthalpy) G plus a small correction factor in the place of A. 

We can also look at the probability of the volume alone by integrating out the energy E at fixed 
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volume: 



P(V\P,P2) 



P(V\fi,Pi) 




A( J 8,P 1 )- 1 2( J 8,y)exp(-/3/' 1 y) 



A(/3, J P 2 )- 1 e( J 8,y)exp(- J 8P 2 y) 



In 



P(V\P,Pi) 
P(V\P,Pi) 
P(V\P,Pi) 
P(V\P,Pi) 



HP1/P2) + [P(G 2 - Gi)] - [P(P 2 -Pi)V] 



PiMP,Pi) 
P2KP.P1) 



cxp(-P[P 2 -Pi]V) 



(11) 



(12) 



We can then use the same techniques already described with A(j8,Pi) in the place of Q(fi,V), Pi 
and P2 in the place of /3i and j6 2 and j8y in the place of £. 

Finally, we can treat the joint probability distributions with both V and E varying indepen- 
dently: 



We can apply most of the same methods described previously with slight modifications for the 
additional dimensions. For example, when fitting log ratio of the distributions, we must now 
perform a multilinear fit in V and E. Multiple variable nonlinear fitting can also be employed. 
However, in both cases, we can quickly run into numerical problems because of the difficulty of 
populating multidimensional histograms with a limited number of samples, making discretization 
error worse. The maximum likelihood method, which already appears to be the most reliable 
method for estimating single variables, does not require any histograms and thus is free from 
discretization error in any dimension. In examining joint variation in E and V in this study we 
therefore focus on only the maximum likelihood method. 



P(V,E\P2,P2) 

P(V,E\p u Pi) 
P(V,E\p 2 ,P 2 ) 

P(V,E|jBi,A) 



P{y,E\p h P{) 



P(V,E\P2,P 2 ) 




(13) 



In 



HP1P1/P1P2) + [P2G2 ~ PiGi] - [j8 2 -Pi]E- [p 2 P 2 - PiPi] V (14) 
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For maximum likelihood maximization we again need to clarify what the free variables are 
in order to fix the form of the probability distribution. The first two are AG = G 2 — G\, setting 
G\ + G2 = 0, and A/3 = /3 2 — Pi, setting (/3i + /3 2 ) /2 = /3 ave = const, as before. By analogy, we set 
(Pi + i , 2)/2 = Pave, with the variable AP = P\ — P 2 . Both /3 ave and P ave are then set at the averages 
of the applied /3 and P of the two simulations. We then find that: 

(m-m) = ^((A J 8)( J P 2+J P 1 ) + (/3 2 + J 8 1 )(AP)) 

= (A J 8) J P aV e + J 8 aV e(AP) (15) 

The explicit maximum likelihood equations for enthalpy, volume, and joint energy and volume are 
then: 

ln ^!! 2, n! = jSave(AG) - (Aj8)H (16) 



ln W^M = ^ave(AG)-(A J 8) J B-(A/3) J P ave y-(AP)/3 ave y 

= J 8 ave (AG)-(A J 8)( J E + J P ave y)- J 8 ave (AP)y (18) 



omitting the unchanged prefactors involving logarithms of the ratios of the known intensive vari- 
ables j8i, fa, Pi and P2. In general, we can ignore this term because we usually don't care about 
the exact value of the free energy difference AG between the paired simulations, and so therefore 
do not need to break the constant term down into its components. 

2.7 Sampling from the isobaric-isothermal ensemble for a toy problem 

To better understand how to validate the volume ensemble, we examine a toy model sampling 
from a modified harmonic oscillator potential. In this case, the harmonic spring constant is in- 
creased by decreasing system volume in order to add a PV work term to the system. We set the 
harmonic force constant K = (a/V) 2 , and for simplicity set xq = 0. This means that A(.P,/3) = 
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I v Q(V,p)exp(-pV)dV which gives 

P(x,V\fi,P) = a^Pf^cxJ-^-ppv) 

We use the Gibbs sampler 7 to generate configurations from the joint distribution P(E,V) by 
alternating sampling in P(E\V) and P(V\E). To sample randomly from P(x\V), we observe that x 
will always be distributed as a Gaussian, with standard deviation o = ^/K/fi = (V/a)/3~ 1//2 . To 
perform conditional sampling in the system volume dimension, we must sample according to the 

/ a 2 x 2 \ 

conditional distribution P(V \X{) exp ( — j8 -^yj — j8PV ) . This is not a typical continuous probabil- 
ity family so there is no simple formula for generating samples from this distribution. However, we 
note that the distribution is strictly less than M exp (— j8PV), where M is the ratio of the normalizing 
constant for the exponential distribution and the normalizing constant for the exponential plus the 
harmonic term. We can then sample V from the exponential distribution exp(— fiPV) and perform 
rejection sampling to sample from the strictly smaller desired distribution P(V\x). Initially, it ap- 

Ba 2 x 2 

pears that the smaller the difference between the two distributions (i.e. the smaller 2 ' ) is, the 
more efficient the sampling will be. However, because xi is generated from a Gaussian distribution, 
(x 2 ) = ~2~> men the average efficiency reduction factor becomes exp(— j8/2), independent of P or 
a, so the acceptance ratio is only significantly affected by the temperature. 

2.7.1 NPT Model System Results 

For all tests, we generate 250,000 samples from each of the paired distributions. To examine the 
enthalpy, we pick /3i = 2/3, fa = 2, and P\ = Pi = 1, and using the maximum likelihood method 
we estimate fa — j8i = 1 .3341 ± 0.0040, only 0.2 quantiles from the true answer of 4/3 (see Fig. 3a 
for the linear plot). To validate the volume sampling, we pick fa = fa = 1.0 and Pi = 1.3 and 
p 2 = 0.7, and find that faP 2 -P l ) = -0.6013 ± 0.0025, 0.53 quantiles from the true answer of -0.6 
(see Fig. 3b) for the linear plot. Finally, when we examine the joint variation of energy and volume, 
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Figure 3: Validation of distributions for harmonic oscillators with pressure We can accurately 
validate the isothermal-isobaric distributions of enthalpy (a) and volume (b) for our harmonic os- 
cillator toy problem with a volume dependent spring constant. 
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we use Pi = 0.6, j6 2 = 0.8, Pi = 0.8 and P 2 = 1.2, which gives us 0.20035 ±0.00318 for the slope 
(Jfc - ft) and -0.48 129 ±0.00185 for the slope p 2 P 2 -piPi, which are 0.1 and 0.7 quantiles from 
the true answers 0.2 and —0.48, respectively. We see that indeed these equations properly capture 
entropy and volume distributions. 

2.7.2 Picking intervals for enthalpy and volume tests 

In the NPT case with differing temperatures and constant pressure, the instantaneous enthalpy E + 
PV takes the place of the energy and a two standard deviation temperature gap will mean choosing 
temperatures separated by ^Jlk B /Cp, instead of ^j2k B /Cy. In the case of an NPT simulation 



performed with constant temperature and at differing pressures, we want 2ay = AP [jpj ■ We 
can use the distribution of volume fluctuations to find that {jfp^ = ~J^T- We therefore must 
have that |AP| = IkgT /oy, or in terms of the physical measurable isothermal compressibility 



Kt = — y \ jpj > = yly^p. Again, this is a guideline, not a strict rule; short simulations at 
the simulation average can also be useful to identify the spread of the distributions, as the answer 
must only be in the right range. For joint distributions, the analysis is more complicated, but it 
seems reasonable to use 2a in both directions, perhaps erring on the low side to ensure sufficient 
samples. 

3 Molecular systems 

3.1 Kinetic energy and potential energy independently obey the ensemble 
validation equation 

In most molecular systems (for example, ones without applied magnetic fields), the potential en- 
ergy of the system is independent of the velocities and masses of the particles. Thus, the potential 
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and kinetic energy are separable, and we can write: 



P(£ P ot + £kinlj3) = e kl „(/3)- 1 ep 0t ( J 8)- 1 a( J E pot )a( J E km )exp(- J 8£ pot )exp(- J 8£ kin ) 

= [e ki n(/3)- 1 a( J E km )exp(- J 8£ kin )] [e pot (/3)- 1 a( J E pot )exp(-/3£ pot )] 
= P(E pot \P)P(E kin \P) 

The separability of the density of states occurs again because the momenta can be sampled inde- 
pendently of the coordinates. The ensemble validation algorithm is therefore valid for the kinetic 
and potential energies independently as well, so that: 

exp(-[/j 2 -piJ£ k in) (19) 



P(^kin|j8l) Gkm(jBl) 



P(£pot|j3i) G P ot(j8i 



exp(-[ J 8 2 - J 8 1 ] J E pot ) (20) 



In the case of kinetic energy, Q kin is simply nJLi I-^vi-^pf l m i) d Pi = Ylf=i(^) 3/2 , meaning 
the probability ratio is: 



P(EUh) (|) W/2 exp([ft _ fc]£ldn) (21) 

(k^S)"" 1 <22) 



^(£kin|j6l) 



which is now in terms of the single free parameter A/3 = j8 2 — j8i rather than two parameters. Note 
that this is true for both identical and non-identical particles, since the mass terms will cancel out 
for all i. In the case of kinetic energy, we can obtain a distribution for each distribution alone, 
because the kinetic energy is simply the sum of 3./V random normal variables with standard de- 
viations (m i )~ 1 / 2 p i , and is thus a % 2 distribution with 3./V (minus any center of mass variables 
removed from the simulation) degrees of freedom (DOF). For more than 60 DOF, corresponding 
to about 20 particles, the distribution is essentially indistinguishable from a normal distribution 
with mean equal to the sum of the means of the individual distributions, which in this case is sim- 
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ply the average kinetic energy. By equipartition the total kinetic energy will simply be The 
standard deviation can be computed by noting that the o 2 = ksT 2 Cv, and that the heat capacity 



to high accuracy for any number of molecules typical in molecular simulations. In the above 
formulas, 3N should be replaced by the correct number of DOF if constraints are implemented or 
if any center of mass degrees of freedom are removed. Standard methods for testing the normality 
of distributions with known means and standard deviations can be used, such as inspecting Q-Q 
plots or the Anderson-Darling test. 8 If the number of degrees of freedom is not available, as may be 
the case when one is analyzing data provided by someone else, then this can be estimated from the 
average of the kinetic energy by equipartition as (E^ n ) = ^^(#DOF). If the kinetic energy is not 
equal to this value, then the reported temperature will not even be correct, which should be noticed 
from simpler outputs of the simulation before running any other more sophisticated analysis like 
the procedures described in this paper. 

The kinetic energy distribution, in addition to following the ensemble validation formula, can 
therefore be checked directly as well, though this does not seem to be common practice in molecu- 
lar simulation validation. The potential energy formula can be used to either validate the potential 
energies separately, or can be used for Monte Carlo simulations, where only potential energies are 
defined. It is also possible to perform this separation in terms of ideal gas and canonical partition 
functions, but it does not change the results, as the volume is constant. 

To obtain separability of kinetic and potential energies in an NPT ensemble, we start by writing 
the isobaric-isothermal partition function in terms of kinetic and potential energy portions of the 



due to the kinetic energy is the ideal gas heat capacity, . Thus o 1 



4m> and 




25 



canonical partition functions, and note that the kinetic energy part is independent of the volume. 



a(js,p) = /3p/e kin ( i 8)e pot (/3,y)exp(- J py)jy 

Jv 

A(/3,P) = Q^nPP J v Q vot (P,V)exp(-PV)dV 
A(j8,P) = e km (/3)A pot (/3,y) 



Then in terms of probabilities, we have: 



P(E,V\P,P) 



e k in(j8)- 1 A pot (/3, V)- 1 exp(-/3£ kin - /3£ pot - pPV) 



P(£kinlM 



2 kin ( J 8)- 1 exp(-/3£ kin ) 



P(E poU V\P,P) 



^pot 



This separation again makes it possible to validate NPT Monte Carlo simulations by removing 
the kinetic energy. 

3.2 Molecular dynamics of Lennard- Jones spheres 

We next illustrate of the utility of the ensemble validation formula for molecular simulations. 
For this study, we used a simulation of 300 Lennard- Jones particles using a beta version of the 
Gromacs 4.6 simulation code compiled in double precision. We used the Rowley, Nicholson 
and Parsonage argon parameters for Lennard- Jones spheres (<7 = 0.3405 nm, e = 119.8 K, ks — 
0.996072 kJ/mol), 9 and simulated at p = 0.85p c , meaning the box is of length 3.5328256 nm 
and T = 0.85T C = 135.0226. Velocity Verlet integration was used, with the exception of the Gro- 
macs stochastic integration method, which is only defined for the leapfrog Verlet algorithm. Linear 
center of mass momentum was removed every step, and a long range homogeneous dispersion cor- 
rection was applied to the energy. Unless otherwise specified, a Lennard- Jones switch between 0.8 
and 0.9 nm was used, with a neighborlist at 1.0 nm, a neighborlist update frequency of 5 step, and 
a time step of 8 fs. Temperature coupling algorithms were carried out with a coupling constant of 
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Xj = 1.0 ps. A total of 62.5 million MD steps were simulated for all simulations, equivalent to 500 
ns with a 8 fs time step, with the last 490 ns used for analysis. Unless otherwise specified, the low 
and high temperatures are T = 132.915 and T = 137.138 respectively, chosen to be approximately 
0.7 times the estimated ideal o gap from the rule of thumb, using Cy ~ 8.5 kJ K _1 mol _1 from a 
preliminary simulation of the system. 

3.3 Molecular example: validating temperature control algorithms 

Using this Lennard- Jones system, we first examine temperature control algorithms implemented in 
Gromacs: Bussi-Parrinello, 10 with stochastic scaling of the target temperature, Andersen tempera- 
ture control, 1 1 a variant of Andersen temperature control with the velocity of all atoms randomized 
at some regular interval x t , Nose-Hoover, 2 stochastic dynamics, and Berendsen velocity scaling. 12 
All of these temperature control algorithms are proven in theory to give the correct canonical 
distribution in the limit of long time scales 13 with the exception of the Berendsen temperature al- 
gorithm, which is known to give an incorrect, overly narrow kinetic energy distribution. 14-16 We 
examine the deviations of the total, potential, and kinetic energies, using analytic errors from the 
maximum likelihood fits. In this analysis we will often use the AP and AT (from the maximum 
likelihood expressions) to describe the deviations from the true distribution to make them more 
intuitive. We can calculate AT from A/3 by assuming an average /3 ave = + fh), and calculat- 
ing T2 = k B 1 (/3 ave + A/3 /2) ~ 1 and T\ =k B l (/3 ave — A/3 /2) ~~ 1 . In all molecular simulations, we also 
compute the correlation times % of the energy observables, using the time series module of the 
pymbar code distribution 17 and subsample the data with frequency 2t+ 1 to obtain uncorrelated 
samples. We find that for the kinetic energies alone, the correlation times are actually artificially 
short when using the algorithm in the t imeseries module, which only integrates out to the first 
crossing of the x-axis. We therefore in this study use the correlation times for the potential energies, 
which are equal to longer than the correlation times of the kinetic or total energies. Subsampling 
more frequently than required only affects the results by decreasing the statistical accuracy due 
to collecting to few uncorrelated measurements, which for a validation test is not as large a prob- 
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Table 3: Ensemble validation of different temperature control algorithms All studied ther- 
mostats are consistent with a canonical ensemble, with the exception of the Berendsen thermostat, 
with deviations from the true slope generally 1 a or less. The true slope is 0.027865 kgT~ l , equiv- 
alent to AT = 4.223. All errors are computed using the maximum likelihood method with the 
analytical error estimate. NVE simulations also deviate from the canonical ensemble though the 
potential energy distributions do not statistically deviate. 
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Figure 4: Differences in validation of Berendsen and Nose-Hoover thermostats. Berendsen 
temperature control produces simulations deviating greatly from the true distribution; in this case, 
the slope ~ Pi of the kinetic energy log ratio is 7 times higher than it should be, 68 standard 
deviations away from the true value. The Nose-Hoover thermostat, like most others examined here, 
gives a slope statistically indistinguishable from the proper slope for the kinetic energy portion of 
the canonical ensemble. 

lem as significantly undersampling the statistical error, which results in using correlated data. For 
the thermostat comparison, we use the subsampling frequencies of 40 ps, which is the maximum 
among all methods, except for the Andersen massive variant, for which we use 60 ps. 

This comparison is presented in Table 3, with all estimates and errors computed using max- 



28 



imum likelihood methods described in this paper. We see that all temperature control methods 
appear to be consistent with a canonical ensemble, with the exception of the Berendsen temper- 
ature control method, with deviations from the true slope generally 1 a or less. NVE kinetic 
energy distributions deviate from the canonical ensemble, though interestingly, potential energy 
distributions do not deviate from the correct distribution to a statistically noticeable level. In all 
cases where there are deviations of the kinetic energy, the distributions of the potential energies 
are closer to the true distribution than the kinetic energy or total energy distributions are; as noted, 
for NVE, the potential energy distribution is statistically indistinguishable from the NVT potential 
energy distribution. 

3.3.1 Molecular example: the effect of large step size 

It is well known that step sizes that are too large can lead to rapid heating of a NVE molecular 
dynamics simulation as the integration deviates from the conserved energy trajectory. This de- 
viation was one of the initial motivations leading to the development of thermostats. However, 
using a thermostat to bleed out the extra thermal energy created by violations of the conservation 
of energy effectively creates a steady state system. The system has both heat being both pumped 
in by numerical integration error and pumped out by the thermostat, with the the average kinetic 
energy having the desired average. However, this steady state process does not necessarily have 
the correct Boltzmann probability distribution. 

There has been relatively little investigation of the effect of step size on the ensemble itself 
when temperature control is applied, 18 especially for atomistic simulations. Here, we examine step 
sizes from 8 fs to 40 fs. In the Gromacs code, a step size of 48 fs with Lennard- Jones argon cause 
segmentation faults within just a few ns and therefore represents the upper limit of stability with 
a thermostat coupling constant with %t = 1 ps. In these units, the reduced time is a(M/e) 1//2 = 
0.1245 picoseconds, so the stability limit is about 0.386 reduced time units. 

However, being below the limit of stability does not necessarily mean that the ensemble is 
correctly reproduced. To analyze the distributions generated by long step sizes, we use the Bussi- 
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Table 4: Effect of step size on ensemble consistency. Total and kinetic energy gradually deviate 
from the true ensemble as step size increases, becoming statistically noticeably near, but not at the 
instability point. Potential energy distributions deviate less significantly from a canonical distribu- 
tion than the kinetic energy distributions. The average half step kinetic energy estimator using the 
leapfrog Verlet deviates less from the true distribution. 
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Parrinello thermostat algorithm, and step sizes ranging from 8 fs to 40 fs (Table 4). Uncorrelated 
potential energy samples were 20 ps apart as determined by the time series module, consistent 
over all steps sizes to within 10%. Uncertainties in effective temperature are determined directly 
from the subsampled kinetic energies, rather than using Gromacs g_energy output, in order to 
have a more consistent treatment of uncertainties between different observables. 

In Table 4, we note that total and kinetic energy gradually deviate from the true ensemble 
with the deviation becoming extremely large near the instability point. For this particular system, 
average temperatures determined by averages of the kinetic energy from a simulation (shown for 
the lower temperature simulation in Table 4) are not as useful to distinguish systems that are being 
forced back to the desired average kinetic energy using the thermostat. 

Interestingly, potential energy distributions deviate much less significantly from the canonical 
distribution than the kinetic energy distributions to the extent that is this deviation is not statistically 
significant. This may relate to the fact that standard estimator of the kinetic energy in the velocity 
Verlet algorithm, the sum of the squared full step velocities times the masses, is not as accurate 
as the estimator of the kinetic energy of the leapfrog Verlet algorithm, which uses the averaged 
half- step kinetic energies. Although deviations increase with the square of the step size in both 
cases, the full step kinetic energies deviate more quickly. 19 We note that it appears to be the choice 
of kinetic energy estimator, not the integration method per se that makes a difference, since the 
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two methods give identical NVE trajectories up to numerical precision. The hypothesis that the 
choice of kinetic energy estimator may make a difference was confirmed by performing the same 
40 fs time step simulation with the leapfrog Verlet integrator and the Bussi-Parrinello algorithm, 
resulting in significantly better kinetic energy distribution without statistically altering the potential 
energy distributions. We note that in this case, although the deviation is statistically very clear, it is 
not necessarily that large. Even for the 40 fs step kinetic energy, the fitted temperature difference 
is only off 10%, which is about 0.4 K, which will not make a difference for most applications. We 
also note that simulations of different molecular systems with different potential functions may 
have different deviations from ensemble consistency as a function of distance from the time step 
stability point. 

3.3.2 Example: Examining the effect of cutoffs on ensemble consistency 

An abrupt cutoff of a radial potential function creates a discontinuity in the force, resulting in 
steadily increasing temperature in an NVE simulation. This temperature rise can, as in the case 
of large time step, again be disguised by adding a thermostat, creating a steady state system that 
does not necessarily obey the canonical distribution. We can examine the effect of this truncated 
potential on the NVT ensemble using our ensemble consistency tests. We run the same Lennard- 
Jones argon system with abrupt cutoffs at r c = 2.0o~, 2.5o~, 3.0o~, 3.5c, and 4.0c, where o here is 
the Lennard- Jones size, not the standard deviation. Because of quirks in the way Gromacs handles 
abrupt cutoffs, we create an abrupt cutoff using a potential switch over a distance of 10 9 nm, 
which on the integration time scale effectively becomes a discrete cutoff. We can measure how 
much such a simulation violates conservation of energy by monitoring the average increase in the 
conserved quantity per unit time. In these simulations, we use the Bussi-Parrinello thermostat, with 
Xj = 1.0 ps, approximately 120 times the time step, with 7| ow = 132.915 and r high = 137.138. We 
can measure the magnitude of energy drift by monitoring the change in the conserved quantity over 
time which varies from 9.40 x 10 3 kJ mol -1 ns _1 for r c = 2.0a to 78 kJ mol _1 ns _1 for r c = 4.0a. 
Times between uncorrected samples, as determined by potential energy differences, were no larger 
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Table 5: Effect of abrupt cutoff on ensemble validation. Distributions are surprisingly ensemble 
consistent for most values of abrupt cutoff for Lennard- Jones spheres, with only the shortest cutoff 
distances (less than 3 LJ a) showing statistically clear violations. 



Tiow = 132.915 

r c (LJ cr) Econs (gained kJ/ns) Estimated Ti ow (K) a deviation 


True Ar = 4.223 
Estimated AT a deviation 




total potential kinetic 


2 9400 133.952 ± 0.045 23.0 
2.5 1140 133.043 ± 0.045 2.9 

3 239 132.941 ± 0.045 0.6 
3.5 104 132.930 ± 0.045 0.3 

4 78 132.929 ± 0.045 0.3 


4.102 ±0.058 2.1 
4.206 ± 0.052 0.3 
4.232 ±0.051 0.2 
4.226 ±0.050 0.1 
4.302 ±0.050 1.6 


4.018 ± 0.084 2.4 
4.177 ± 0.059 0.8 
4.291 ± 0.065 1.1 
4.302 ± 0.058 1.4 
4.307 ± 0.057 1.6 


4.122 ±0.070 1.4 
4.176 ±0.071 0.7 
4.213 ±0.071 0.1 
4.135 ±0.070 1.2 
4.192 ±0.071 0.4 



than 25 ps for all systems, so we use this sampling time frequency all three quantities. 

We see in Table 5 that the distributions are surprisingly ensemble consistent for most values of 
abrupt cutoff for Lennard- Jones spheres despite the fact that the simulation is gaining more than 
200 kJ/mol/ns with a 3 a cutoff. We note that in this case, the deviation from desired temperature as 
calculated from average kinetic energy is fairly clear (23 standard deviations for a 2 a cutoff!) and 
therefore this measure appears to be better at distinguishing violations from the correct distribution 
than the ensemble consistency check. This contrasts with the case of varying step size, where the 
ensemble consistency check was more sensitive than the deviation from the correct average kinetic 
energy. Clearly, multiple validation methods should always be performed! 

3.3.3 Validating the gap selection criteria for molecular systems 

Finally, we attempt to validate our rule of thumb for the ideal temperature gap with molecular 
systems, since it was derived for a simplified model system. We test the ability to detect error 
using the same Lennard- Jones argon system with time step At = 32 fs using velocity Verlet, as for 
higher temperatures, a time step of At = 40 fs can crash in simulations extending for hundreds of 
ns long. Measuring the heat capacity as 8.5 kJ mol -1 K _1 at 135 K leads to a standard deviation of 
36 kJ/mol and an estimated ideal temperature gap of 6 K between the means of the two total energy 
distributions. We see (Fig. 6) that we are most sensitive to error in the total energy between 1 and 
2 times the estimated gap, meaning that our analytical guidelines were close, but that a slightly 
larger gap might sometimes be more effective in identifying errors. We note that the kinetic energy 
standard deviation at 135 K (24 kJ/mol) is only about 2/3 of the total energy standard deviation, 
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Table 6: Molecular validation of ideal gap guidelines. We test the temperature gap for maximum 
discrimination rule with Lennard- Jones argon with time step At — 32 fs. Maximum discrimination 
of error in the ensemble consistency for the different energy terms occurs between 1 and 2 times 
the estimated gap rule (column 3). 









Kinetic 




Potential 




Total 




AT/T 


h-Pi 


nxgapop, 


Est. slope 


o~ deviation 


Est. slope 


o~ deviation 


Est. slope 


o~ deviation 


0.0156 


0.01393 


0.4 


0.01302 ± 0.00023 


4.0 


0.01413 ± 0.00021 


0.9 


0.013518 ± 0.00016 


2.6 


0.0313 


0.02787 


0.7 


0.02598 ± 0.00025 


7.7 


0.02585 ± 0.00018 


1.4 


0.026433 ± 0.00018 


7.8 


0.0469 


0.04182 


1.1 


0.03892 ± 0.00027 


10.7 


0.04152 ± 0.00027 


1.1 


0.039990 ± 0.00023 


8.1 


0.0626 


0.05578 


1.4 


0.05190 ±0.00031 


12.6 


0.05526 ±0.00031 


1.8 


0.053343 ± 0.00029 


8.6 


0.0938 


0.08378 


2.1 


0.07791 ±0.00042 


14.2 


0.08290 ± 0.00045 


2.0 


0.079531 ± 0.00049 


8.6 


0.1251 


0.11189 


2.8 


0.10464 ±0.00059 


12.2 


0.11150 ± 0.00071 


1.6 


0.107768 ±0.0010 


4.2 


0.1877 


0.16867 


4.2 


0.1522 ±0.0016 


10.1 


0.1654 ±0.0022 


1.5 


0.165805 ±0.0057 


0.5 


0.2502 


0.22645 


5.6 


0.2099 ± 0.0037 


4.5 


0.220 ± 0.010 


0.6 


0.249 ± 0.072 


0.3 



but since the total heat capacity (8.5 kJ K mol ) is more than twice as large as the ideal gas heat 
capacity (3.72 kJ K 1 mol -1 for this size of system), the kinetic energy distributions have closer 
mean values than the total energy distributions. Thus the range of peak discrimination for kinetic 
energy still falls in the 1 to 2 times the "twice the central standard deviation" rule of thumb, when 
using the distribution of kinetic energies. For molecular systems, the ideal gap might therefore 
be better estimated using a temperature gap 1.5 to 2 times the estimated cap range. However, a 
relatively wide range of values allows discriminating lack of ensemble validity if sufficient data is 
collected. 



3.4 Examining pressure control algorithms 

There are currently three pressure control algorithms implemented in Gromacs: Berendsen, 12 
Parrinello-Rahman, 20 ' 21 and the Martyna-Tuckerman-Tobias-Klein (MTTK) algorithm. 22 ' 23 The 
first two are defined using the leapfrog integrator in Gromacs, and the first and last are defined us- 
ing the velocity Verlet integrator. We next examine the same small argon system for fluctuations of 
enthalpy and volume, and the joint fluctuation of volume and energy. A velocity Verlet integrator 
was used except for Parrinello-Rahman, with At = 8 fs. We set the pressure coupling x p to 5 ps in 
all cases and use P = 90 bar and T = 125 K as the average pressure and temperature, resulting in a 
system well below the critical point. When testing volume fluctuations or joint energy and volume 
fluctuations, a low pressure of 30 bar and a high pressure of 150 bar were used (AP = 120 bar), 



33 



except for Berendsen pressure control, where low and high pressures of 88 and 92 bar (AP = 4 
bar) were used. A lower range is needed for the Berendsen weak coupling algorithm as the volume 
distributions are far smaller than is correct for the distribution (already demonstrating a problem). 
When testing enthalpy fluctuations or joint energy and volume fluctuations, a low temperature of 
121.431 K and a high temperature of 128.569 K (A/3 = 0.054987 k B T~ l , AT = 7.138 K) were 
used, generated using an estimated Cp of 10.2 kJ/mol from short initial simulations for this system 
using the estimated gap formula. For Berendsen thermostat simulations, a temperature range of 
124.108 to 125.892 K was used (A/3 = 0.013736, equivalent to AT = 1.784), as again the overlap 
between the distributions is very poor for wider parameter differences. Nose-Hoover temperature 
control with Xj = 1 ps was used for both Parrinello-Rahman and MTTK algorithms. The AP for 
joint energy and volume comparisons is smaller because the simulations are run at different tem- 
peratures and is equal to AP = 114.861 bar for Parrinello-Rahman and MTTK and 2.715 bar for 
Berendsen. 

Looking at Table 7, we see that the Parrinello-Rahman and MTTK algorithms reproduce very 
accurately the correct enthalpy distributions, deviating very little from the correct A/3, with very 
high statistical confidence. The precision is high partly because the time between uncorrelated 
samples (in this case, determined from the largest correlation time of either the energy or the vol- 
ume) is quite short, in the range of 4-6 ps. The volume distributions, however, are somewhat off, 
with the effective AP in both cases near 115 ± 0.6 instead of 120. For most cases, this will be 
sufficiently accurate to model physical processes (and is far better than the Berendsen results), but 
might not be sufficiently accurate for very high precision thermodynamic measurements. The 9a 
deviation from the true answer is again not necessarily a sign of how bad the simulation is. In 
this case, because the slope is nearly correct, it sign that it is statistically very likely the simulation 
is at least somewhat off rather than simply being very bad. Similar patterns are seen in the joint 
distribution of E and V, where the effective AP is still off by about 5 bar (or around 5%). The de- 
viations are similar for both MTTK and Parrinello-Rahman, even though these integration routines 
are mostly separated in the Gromacs code. 
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Table 7: Ensemble validation of pressure control algorithms Tests of enthalpy distribution, 
volume distribution, and joint energy and volume distributions. The Berendsen barostat fails badly 
in all three tests. The other two barostat give correct enthalpy distributions, but have small (AP off 
by 5 bar or ps 5%) but statistically clear (7-9(7) errors in the volume distributions. 





enthalpy 


volume 




joint energy and volume 




Barostat 


A slope o" deviation 


A slope o~ deviation 


A slope 


a deviation A slope 


o" deviation 


Berendsen 


4.176 ±0.121 19.8 


79.5 ±4.4 17.1 


0.69 ±0.14 


7.6 -318.661 ± 7.322 


43.9 


Parrinello-Rahman 


7.022 ± 0.033 3.5 


114.58 ±0.57 9.5 


7.168 ±0.036 


0.8 110.971 ±0.529 


7.4 


MTTK 


7.105 ±0.029 1.2 


115.51 ±0.50 9.0 


7.152 ± 0.031 


0.5 111.312 ±0.457 


7.8 



For Berendsen, the results are uniformly bad. In all cases, the deviation from the expected 
values is significantly higher than with MTTK or Parrinello-Rahman, with the slopes being much 
further from the true value even though the statistical error is much higher as well. This devi- 
ation exists even though the average temperatures and pressures in the Berendsen case were all 
well within statistical noise. For example, for the joint distribution analysis, the low and high 
average pressures were indeed 87.996 ± 0.003 and 91.998 ± 0.005 bar and the average tempera- 
tures were 125.865 ±0.015 K and 124.081 ±0.015, well within the statistical noise. Errors in the 
fitting parameters are therefore due to unphysically narrow distributions, not the average values 
themselves. We note one other potential strange problem with Berendsen volume control com- 
bined with Bussi-Parrinello thermostat. The autocorrelation times are much longer than with other 
simulation variables, on the order of 20 ps for the energies and 110-130 ps for the volumes. It 
is not clear what exactly is causing such slow change of these variables when the time constants 
themselves are much lower — in this case Xj = 1 ps and Xp = 5 ps — but perhaps indicates another 
reason to avoid Berendsen pressure control. 

3.5 Water simulations 

We also examine a somewhat more typical system for molecular simulation, a small box of 900 
TIP3P water, a size that might be used to compute pure water properties or small molecule sol- 
vation free energies. We again use velocity Verlet integration (with the exception of the Gromacs 
stochastic integration, which is only defined for the leapfrog Verlet algorithm) with linear center of 
mass momentum removal every step and a long range homogeneous dispersion correction applied 
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to the energy and virial. We use a Lennard- Jones switch between 0.8 and 0.9 nm with a neigh- 
borlist at 1.0 nm and particle mesh Ewald electrostatics with cutoff 1.0 nm, PME order 6, and 
Ewald cutoff tolerance of 10~ 6 . In all cases, neighborlist update frequency of 10 steps was used 
with a time step of 2 fs. SETTLE 24 ' 25 was used to constrain the water bonds and angles, and a 
total of 10 million steps (20 ns) were simulated, with the last 19 ns used for analysis. Temperature 
coupling algorithms were carried out with a coupling constant of Xj = 1.0 ps for the NVT simula- 
tions and Xj = 5.0 and tp = 5.0 for the NPT simulations. The low temperature is 298 K and 301 
K, with AT /T = 0.01 estimated from Ce in the total energy from a single short simulation using 
the relationships for the ideal temperature gap. For the NPT simulations, using a oy = 0.25nm 3 at 
1 bar from a short simulation predicts a AP of 238 bar using the formula presented here, but to err 
on the side of having sufficient samples we instead use AP =175 with the low pressure at 1 bar 
and the high at 351 bar, though we are potentially losing some precision. In the case of Berendsen 
pressure control, we used AT = IK, and AP = 30 bar to ensure overlap. For NVT, the interval 
between uncorrelated samples is determined from correlation times of the potential energy which 
is 2 ps for all methods except the Andersen method, where we use 4 ps. For NPT, we use the 
maximum of the uncorrelated sample intervals between the volume and the energy. Correlation 
times for MTTK are much smaller, around 0.3-0.4 ps for both energy and volume, whereas for 
Berendsen the energy and volume uncorrelated sample intervals are both 4 ps, and for Parrinello- 
Rahman, the energy and volume intervals are 6 ps and 0.4 ps, respectively. Thus, the NPT MTTK 
results are somewhat more precise. 

We first examine the NVT results in Table 8. These results are completely in keeping with the 
argon results before, with all temperature control methods well within statistical error, with the 
exception of Berendsen, which is again wildly incorrect. These results demonstrate that the utility 
of ensemble validation is applicable to more typical molecular simulations, with data set sizes that 
are more typical for a standard testing pipeline. 

From the NPT results in Table 9, we see Parrinello-Rahman and MTTK have reasonable per- 
formance in describing the enthalpy distribution. Berendsen in this case is also reasonable, perhaps 
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Table 8: Ensemble validation of different temperature control algorithms with water AT = 3 

K corresponding to a inverse temperature slope of is 0.004023 (ksT)^ 1 . Results are consistent with 
those performed with argon, with all temperature control algorithms ensemble consistent except 
for Berendsen. 





total 




potential 




kinetic 




Thermostat 


Slope 


a deviation 


Slope a deviation 


Slope a deviation 


Berendsen 


51.6 ± 1.1 


44.2 


7.20 ±0.12 


34.7 


4.86 ±0.12 


15.8 


Stochastic 


2.998 ± 0.059 


0.04 


2.944 ± 0.069 


0.8 


3.032 ± 0.090 


0.4 


Nose-Hoover 


2.921 ± 0.058 


1.4 


2.953 ± 0.068 


0.7 


2.837 ± 0.089 


1.8 


Andersen 


3.028 ± 0.083 


0.4 


3.114 ±0.098 


1.2 


2.870 ±0.126 


1.0 


Andersen (Massive) 


3.086 ± 0.083 


1.0 


3.048 ± 0.097 


0.5 


3.136 ±0.127 


1.0 


Bussi-Parrinello 


2.955 ± 0.058 


0.8 


2.956 ± 0.068 


0.6 


3.021 ± 0.090 


0.2 



Table 9: Ensemble validation of pressure control algorithms in water. Tests of enthalpy distri- 
bution, volume distribution, and joint energy and volume distributions. For Parrinello-Rahman and 
MTTK, the true AT = 3 and true AP = 350, while for Berendsen, they are AT = 1 and AP = 30 in 
the joint energy and volume case. The Berendsen barostat performs significantly worse, requiring 
a much narrower range of variables to get any overlap. The other two barostat give statistically 
valid enthalpy distributions, with MTTK appearing to have fairly accurate volume distributions 
and with Parrinello-Rahman having somewhat worse volume behavior. 





enthalpy 


volume 




joint energy and volume 




Barostat 


Slope a deviation 


Slope a deviation 


Slope 


a deviation Slope 


a deviation 


Berendsen 


1.03 ±0.15 0.2 


262 ± 25 9.0 


1.67 ±0.21 


3.3 250 ± 30 


7.4 


Parrinello-Rahman 


2.65 ± 0.21 1.7 


309.3 ±3.7 11.1 


4.09 ± 0.34 


3.2 354 ± 19 


0.2 


MTTK 


2.978 ± 0.053 0.4 


335.7 ±3.9 3.7 


3.026 ± 0.074 


0.4 345.7 ± 4.6 


0.6 
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because the entropy contribution dominates for the nearly incompressible water. MTTK has some- 
what better results for volume fluctuations than Parrinello-Rahman. It is interesting to speculate 
on exactly the source of the difference between the volume fluctuation results in the argon and the 
water examples. In the argon example, both pressure control algorithms had small but statistically 
noticeable errors that were consistent between the two algorithms. In the water example MTTK 
appears to be fairly ensemble consistent, whereas Parrinello-Rahman is slightly worse. Parrinello- 
Rahman with leapfrog is known to be inexact because the pressure lags by one time step, as the 
pressure and temperature are not both known at a given time t until after the next half step. This 
may be more of a problem in the case of water because with a higher compressibility, volume 
integration is a stiffer equation, requiring more exact solutions. We can tentatively conclude that 
typical aqueous simulations using MTTK may be more consistent with an NPT ensemble than 
Parrinello-Rahman, though both are far better than Berendsen temperature control. 

4 Tools 

To make these ensemble consistency checks easier, we have created a set of tools to assist other 
researchers to more easily measure the ensemble validations. This code is hosted by SimTK, at 
http : //simtk . org/home/checkensemble and includes automatic plotting of linear and 
nonlinear graphs, linear, nonlinear, and maximum likelihood parameter analysis. These software 
tools were used for all analysis in this paper. These tools include example code for parsing Gro- 
macs, CHARMM and Desmond output files for ensemble consistency for both NVT and NPT 
simulations, including testing enthalpy, volume, and joint energy and volume fluctuations. Scripts 
to regenerate all the harmonic oscillator analytic tests described in this paper are also included in 
the distribution. 



38 



5 Conclusions 



We have shown that for molecular distributions characterized by Boltzmann distributions, which 
is true for essentially all molecular simulations performed at NVT and NPT, we can easily check 
for consistency with the intended ensemble regardless of the details of the simulation. We simply 
require pairs of simulations with differing external parameters such as temperature, pressure, or 
chemical potential. These paired simulations allow system-dependent properties such as densi- 
ties of states to cancel out, resulting in a linear relationship between the distribution of extensive 
quantities such as energy, volume, enthalpy, and number of particles. Importantly, the constant of 
proportionality in this linear relationship is completely determined by the intensive variables that 
are set by the user. 

Tests of simple model systems shows that these relationships are not only qualitatively useful 
but also that with proper error analysis can provide quantitative validation of the statistics of the 
distributions. We have demonstrated the utility of these relationships with simple analytical toy 
models of harmonic oscillators in both the NVT and NPT ensembles as well as with molecular 
simulations of argon and water. We see that these ensemble consistency relationships are able to 
identify thermostats and barostats that are inconsistent with the ensemble as well as identify dif- 
ferences in distributions caused by long time steps or abrupt cutoffs. All tested thermostats except 
the Berendsen thermostat give statistically good results. Barostats were somewhat more problem- 
atic, with MTTK giving the best results and Parrinello-Rahman being acceptable for many uses, 
while Berendsen pressure control is simply wrong for any calculation where volume fluctuations 
are important. In all cases, simpler checks such as making sure estimators of quantities like the 
temperature and pressure calculated from the kinetic energy and the virial do indeed have the cor- 
rect value are useful as diagnosis tools, and may occasionally identify problems that are not easily 
identified by the ensemble consistency methods tested here. 

These relationships between pair distributions are true for all differences in applied external 
thermodynamic variables. However, there are statistical reasons for choosing specific differences 
in the parameters. We have shown that for simple potentials both small and large differences in 
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the applied system parameters lead to difficulty in distinguishing systems with errors from systems 
with the correct distributions. We have also shown that for typical probability distributions, choos- 
ing distributions whose means are separated by gaps two to four times the sum of the standard 
deviations appears to maximize the ability to discriminate between data that is or is not consistent 
with the desired ensemble, erring on the shorter side in cases where less data might be available. It 
is also important not to underestimate the autocorrelation time for the energy variables to be able 
to accurately use the error estimates, as it may give inaccurately high deviations from the correct 
distribution. Indeed, in typical simulation cases, the ability to properly estimate correlation times 
may be the largest source of uncertainty, as all other parts of calculations are highly robust. We 
also emphasize that the size of the statistical deviation is a measure of how certain we are of the 
discrepancy, not necessarily the size of the discrepancy, as with sufficient data, we can statistically 
identify with a high certainty small deviations that generally do not affect simulation properties 
significantly. Finally, we note that these are very sensitive necessary tests, but they are not suffi- 
cient tests; they cannot guarantee that all states with the same energy are equally sampled, nor can 
they guarantee that all important regions of phase space are sampled. 

We have also developed simple software tools to easily perform the statistical validation dis- 
cussed here, requiring only lists of the relevant extensive variables and specification of the intensive 
applied variables. These tools can be easily incorporated into the workflow for molecular simu- 
lation testing, hopefully greatly reducing the difficulty of determining whether a given algorithm 
or software program is producing the desired thermodynamic ensemble. Future potential improve- 
ments of these tools include adapting the tools for grand canonical simulations and translating the 
relatively unsophisticated quantile validation into full statistical hypothesis testing. 
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A Grand Canonical Ensemble 

Although no grand canonical simulations were carried out in this study, all the equations are es- 
sentially equivalent in the case of the isobaric-isothermal ensemble with — /i taking the place of P 
and ./V taking the place of V. 

P(x,NU3^)=Z(P^r l eM-PE + PLiN) (23) 

Examining the probability of N at fixed /3 and P performed at two different chemical potentials /ii 
and /I2, we obtain 

P(x,N\P,m) = S( J 8,Aii)- 1 e(i8,^V)exp( J 8 1 ^) 
P(x,N\P,n 2 ) = Z(P^ 2 )- l Q(P,N)cxp(p 2 ^N) 

ln Ww^l = + (25) 

We note that in the grand canonical case, N is already discrete, so a histogramming approach intro- 
duces no additional approximations as long as the histograms are fine grained down to integers. For 
samples sizes large enough that larger bins are required for accurate determination of probabilities, 
the maximum likelihood method will be more accurate. 
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We can also treat the joint probability distributions of N and E. 



a{N,E)Z{^^)~ x exp(- J 8 1 £ + frMiAO 
Cl(N, E)S(j6 2 , exp(-/3 2 £ + (h^N) 

exp(-[j8 2 - m + [/3 2M2 - Pm]N) (26) 

-(MPVh - jSi (FVOi) - [02 - J3i]£ + [&M2 - j3i/ii]iV (27) 

This approach can easily be generalized to multiple chemical species, especially when using max- 
imum likelihood methods to allow minimization of the resulting multidimensional probability ra- 
tions. For example, for an arbitrary number of species N with associated chemical potentials jU we 
have: 



P(N,E\p h Lii) = 

P(N,E\lh,H2) = 

P{N,E\k,iL 2 ) 

P(N,E\p h Vi) 

P(N,E\P2,n 2 ) 

P(tf,E|jBi,/ii) 



P(N,E\fa) = ^( J E,/7)S 1 ( J 8 1 , i Ui)" 1 exp(- J 8 1 £ + i 8 li ai-^) 
P(N,E\k) = ^( J E,/7)S 2 ( J 8 2 , i U2)" 1 exp(- J 8 2J E + i 8 2i a 2 -^) 

p(N,E\(h,n2) si(jSi,/ii; 



P(ft,E\p h m) " S 2 (^ 2 ,M2) 



exp(-[j8 2 - J8i]£ + [j8 2i Q 2 - J8i/Ti] • N) (28) 



ln oS^f!« 2,M i = -(J8 2 (W) 2 - J8 X (PV)0 - [J82 - jS!]^ + [J82M2 - J81M1] -iV (29) 

p(N,E\p h m) 



B Weighted least squares fitting to histogram ratios 

Assume we are collecting data from a continuous, one-dimensional probability distribution in a 
histogram H with k = I...K bins. We have N total samples, with {«i,n 2 , . . . observations 
in each bin, so that £fc=i = N. The empirical probability of finding an observation in bin k is 
simply pk = nk/N. Repeating this experiment will lead to slightly different results for the p k . The 
standard estimator of variance of p k due to this sampling variance is a standard result and is equal 
top k (l-p k )/N. 

Given two histograms Hi and H 2 that have aligned bins with N\ and N 2 samples each, the 
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ratio of the probabilities of H2 over H\ will be r k = Pk.i/Pk.2 f° r ea ch bin, where p ki and p k ^ 2 
are the probabilities in the Mi bin for the first and second simulation in the pair. The data in the 
two histograms are collected independently, so the statistical variance in the logarithm of the ratio 



The variance in the ratio of the histograms themselves, useful for computing nonlinear estimates 
of the error will be: 



Define a diagonal weight matrix W, with one over the variance in the z'th measurement along the 
diagonal. If we have a multivariate function F linearly dependent on data vector X as F = AY, 
with A a constant matrix, then the covariance matrix of uncertainties cov (F) will be equal to 
A cov(y)A r . In the case of weighted linear least squares, cov(F) is the matrix of weights W, 
where Wu = o 2 , the variance of each histogram ratio point. If a is the vector of parameters and 
X is the (M + 1) x N matrix of observables, with the first column all ones and the second through 
(M+ l)th column the values of the observations of the M observables, then we will have for a: 



\nr k = In {pk,l/ Pk,\) wli l be to first order: 



var (In r*) 



var(p fc j) var(p M ) 

2 2 

P{\ Pk,2 




(30) 




(3D 



a = (X T WXy l X T WY 
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Plugging this into the equation for cov(a) in terms of cov(F), some linear algebra leads to a 
covariance matrix of the parameters a of (X T WX)~ l . If we have instead a nonlinear least squares 
problem, at the minimum, we obtain a similar covariance matrix, except that we replace X with the 
Jacobian matrix / defined by Jjj = 3^7^ , where / is the nonlinear model, and yi is the z'th data 
point. This leads to a final equation for the covariance of the parameters: 

cov(a) = (J T WJ)~ l 



C Maximum Likelihood Estimation and Analytical Error Es- 
timates 

For a general Boltzmann-type probability distribution the ratio of probabilities must satisfy: 

ln^S =exp(-a-Z) (32) 

where the X are the M sample variables (such as E or V), the M + 1 a, variables are the cor- 
responding conjugate variables specified by the simulation ensemble, and a ■ Xi is shorthand for 
Oq + Ylf=\ a jXj rather than the standard dot product. 

We develop the solution by finding maximum likelihood parameters along the lines of the 
solution presented in (Ref. 5 ). The ratio in Eq. 32 can be interpreted as P(X\ l)/P{X\2) where 
P(X\i) is the conditional probability that an observation is from the rth simulation given only the 
information X. We would like to compute the likelihood of a given set of a parameters given sets 
of measurements with the specific simulation i each set comes from known. 

Using the rules of conditional probabilities, and the fact that P(X\l) -\-P(X\2) = 1, we rewrite 
this probability distribution as follows: 

P(2\X)P(X) 

P(X\2) _ ^ P i^ _ P(2\X)P(1) _ P(2\X) />(!) 
P(X\1) ni^Mg) P(1\X)P(2) \-P(2\X)P{2) 
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We note that P{\)/P{2) = N1/N2, where N2 and N\ are the number of samples from the two 
simulations respectively. Although either P(X\1) or P(X\2) can be eliminated, we are left with 
one independent continuous free energy distribution. Writing either P(X\2) or P(X\1) in a closed 
form is system dependent; specifically, it depends on the unknown density of states. We define the 
constant M = ^(Nf/Nr) and rewrite Eq. 32 as: 

P(2\X) M 
1 _ p | 2| ^ = exp(-M - Ob - £ ajKj) (34) 

Given Eq. 34, we can rewrite the probability of a single measurement P(l \Xi) or P(2\Xi) as: 

1 



l+exp(Af + 3- X f ) 



P(2\Xi) = — (35) 

l+exp(-M — a-Xi) 

The total likelihood of any given observation X ; is the product of all the individual likelihoods, 
giving: 

lnL(a|data) = £ln/(-M- a + £ln/(M+ a ■%) 

i=\ ;=1 

where f(x) = [1 +exp(x)] _1 is the Fermi function. This likelihood equation can be minimized 
directly or by finding the gradient with respect to the a parameters and solving for V(lnL) = to 
give the maximum likelihood result. The log likelihood function has a single minimum, and thus 
there will be only a single root to V(lnL) =0. 

The covariance matrix of each oc 7 can be written in terms of the Fisher information: 

var(a 7 )=/(a J )- 1 = -^^. (36) 

da] 

Note that in Ref., 5 an additional factor dependent on the number of samples was required to get the 
correct uncertainty estimates. In that case, we assumed that the simulation was conducted prop- 
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erly, so that /3 was known and thus had an additional constraint, leaving only a single parameter 
estimated from ratio of two distributions, which ends up reducing the uncertainty by this constant 
factor. 26 In this case, we are solving for two parameters using the data from two distributions, and 
no implicit constraints are applied. Thus the correction is not required. 
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